Spatial Statistics 2023: Climate and the Environment
EPA (USA)
St. Lawrence University (USA)
NOAA (USA)
2023-07-18
Michael Dumelle is a statistician for the United States Environmental Protection Agency (USEPA). He works primarily on facilitating the survey design and analysis of USEPA’s National Aquatic Resource Surveys (NARS), which characterize the condition of waters across the United States. His primary research interests are in spatial statistics, survey design, environmental and ecological applications, and software development.
Matt Higham is an assistant professor of statistics at St. Lawrence University, a small liberal arts college in Canton, New York. His primary research interests are in spatial statistics and in applications of statistics to ecological settings. He enjoys using R to teach undergraduate students the foundations of statistical computing and data science.
Jay Ver Hoef is a senior scientist and statistician for the Marine Mammal Lab, part of the Alaska Fisheries Science Center of NOAA Fisheries, located in Seattle, Washington, although Jay lives in Fairbanks, Alaska. He is a fellow of the American Statistical Association, and Jay consults on a wide variety of topics related to marine mammals and stream networks. His main statistical interests are in spatial statistics and Bayesian statistics, especially applied to ecological and environmental data.
The views expressed in this workshop are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency or the U.S. National Oceanic and Atmospheric Administration. Any mention of trade names, products, or services does not imply an endorsement by the U.S. government, the U.S. Environmental Protection Agency, or the U.S. National Oceanic and Atmospheric Administration. The U.S. Environmental Protection Agency and the U.S. National Oceanic and Atmospheric Administration do not endorse any commercial products, services, or enterprises.
spmodel?spmodel is an R package to fit, summarize, and predict for a variety of spatial statistical models. Some of the things that spmodel can do include:
spmodel?There are many great spatial modeling packages in R. A few reasons to use spmodel for spatial analysis are that:
spmodel syntax is similar to base R syntax for functions like lm(), glm(), summary(), and predict(), making the transition from fitting non-spatial models to spatial models relatively seamless.spmodel capabilities that give the user significant control over the specific spatial model being fit.spmodel is compatible with other modern R packages like broom and sf.splm().The sulfate data in spmodel contains data on 197 sulfate measurements in the continental United States
Simple feature collection with 6 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 162932.8 ymin: 1080571 xmax: 914593.6 ymax: 1453636
Projected CRS: NAD83 / Conus Albers
sulfate geometry
1 12.925 POINT (817738.8 1080571)
2 20.170 POINT (914593.6 1295545)
3 16.822 POINT (359574.1 1178228)
4 16.227 POINT (265331.9 1239089)
5 7.858 POINT (304528.8 1453636)
6 15.358 POINT (162932.8 1451625)
We fit and summarize a spatial linear model with an intercept by running
Call:
splm(formula = sulfate ~ 1, data = sulfate, spcov_type = "exponential")
Residuals:
Min 1Q Median 3Q Max
-5.738 -2.605 4.900 13.323 38.099
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.924 6.529 0.907 0.364
Coefficients (exponential spatial covariance):
de ie range
85.8 10.4 3105165.7
broom FunctionsTidy the fixed effect output
# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.92 6.53 0.907 0.364
Glance at the model fit
# A tibble: 1 × 9
n p npar value AIC AICc logLik deviance pseudo.r.squared
<int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 197 1 3 1140. 1146. 1146. -570. 196. 0
Augment the data with model diagnostics
broom FunctionsSimple feature collection with 197 features and 6 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -2292550 ymin: 386181.1 xmax: 2173345 ymax: 3090370
Projected CRS: NAD83 / Conus Albers
# A tibble: 197 × 7
sulfate .fitted .resid .hat .cooksd .std.resid geometry
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <POINT [m]>
1 12.9 5.92 7.00 0.00334 0.00161 -0.694 (817738.8 1080571)
2 20.2 5.92 14.2 0.00256 0.00192 0.865 (914593.6 1295545)
3 16.8 5.92 10.9 0.00259 0.000395 0.390 (359574.1 1178228)
4 16.2 5.92 10.3 0.00239 0.000363 0.390 (265331.9 1239089)
5 7.86 5.92 1.93 0.00202 0.00871 -2.07 (304528.8 1453636)
6 15.4 5.92 9.43 0.00201 0.000240 0.345 (162932.8 1451625)
7 0.986 5.92 -4.94 0.00380 0.000966 -0.503 (-1437776 1568022)
8 0.425 5.92 -5.50 0.0138 0.00584 -0.646 (-1572878 1125529)
9 3.58 5.92 -2.34 0.00673 0.0000148 -0.0467 (-1282009 1204889)
10 2.38 5.92 -3.54 0.0123 0.0000139 -0.0335 (-1972775 1464991)
# ℹ 187 more rows
Augment prediction data
Simple feature collection with 100 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -2283774 ymin: 582930.5 xmax: 1985906 ymax: 3037173
Projected CRS: NAD83 / Conus Albers
# A tibble: 100 × 2
.fitted geometry
* <dbl> <POINT [m]>
1 1.62 (-1771413 1752976)
2 24.4 (1018112 1867127)
3 8.95 (-291256.8 1553212)
4 16.5 (1274293 1267835)
5 4.93 (-547437.6 1638825)
6 26.8 (1445080 1981278)
7 2.87 (-1629090 3037173)
8 14.3 (1302757 1039534)
9 1.53 (-1429838 2523494)
10 14.3 (1131970 1096609)
# ℹ 90 more rows
Visualize predictions
Say hi to your neighbors! What type of work or analyses do you do in the field of spatial statistics?
05:00
Visit spmodel's website at https://usepa.github.io/spmodel. Navigate to the “References” tab and explore some other functions available in spmodel.
05:00
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, \qquad(1)\]
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \qquad(2)\]
spmodel using splm() (the workshop’s focus)spmodel using spautor()
Navigate to the Help file for splm by running ?splm or by visiting this link and scroll down to “Details.” Examine the spatial linear model description in the Help file and relate some of the syntax used to the syntax used in this section.
05:00
With your neighbor(s), verify that you reach the same conclusions in the previous exercise, relating the syntax in the Help file to the syntax in the spatial linear model from this section.
03:00
splm() function using the moss data.splm() to the spatial linear model.The moss data contains a variable for log Zinc concentration for moss samples collected near a mining road in Alaska.
splm() functionThe splm() function shares syntactic structure with the lm() function and generally requires at least three arguments
formula: a formula that describes the relationship between the response variable (\(\mathbf{y}\)) and explanatory variables (\(\mathbf{X}\))data: a data.frame or sf object that contains the response variable, explanatory variables, and spatial information.spcov_type: the spatial covariance type ("exponential", "matern", "spherical", etc; 17 total types)Estimation Methods:
weights) (Cressie 1985)
Call:
splm(formula = log_Zn ~ log_dist2road, data = moss, spcov_type = "exponential")
Residuals:
Min 1Q Median 3Q Max
-2.6801 -1.3606 -0.8103 -0.2485 1.1298
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.76825 0.25216 38.74 <2e-16 ***
log_dist2road -0.56287 0.02013 -27.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared: 0.683
Coefficients (exponential spatial covariance):
de ie range
3.595e-01 7.897e-02 8.237e+03
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \qquad(3)\]
Another data set contained within the spmodel package is the caribou data set. Fit a spatial linear model with
z as the response and tarp, water, and the interaction between tarp and water as predictorsx as the xcoord and y as the ycoord
After fitting the model, perform an analysis of variance using anova() to assess the importance of tarp, water, and tarp:water.
07:00
Compare your anova model output to the anova model output of your neighbor(s). Are the inferences made on the fixed effects very different for your two choices of spatial covariance model?
03:00
The glance() function returns columns with the sample size (n), number of fixed effects (p), number of estimated covariance parameters (npar), optimization criteria minimum (value), AIC (AIC), AICc (AICc), log-likelihood (loglik), deviance (deviance), and pseudo R-squared (pseudo.r.squared)
The glances() function allows us to compare the model fit statistics for a few different models simultaneously. Use splm() to fit a model with a Matérn spatial covariance (by specifying "matern" for spcov_type) and a model with no spatial covariance (by specifying "none" for spcov_type). Then, use the glances() function, providing each model object as an argument, to compare the model fit statistics of each model.
05:00
The augment() function returns columns with
.fitted, the fitted value, calculated from the estimated fixed effects in the model.hat, the Mahalanobis distance, a metric of leverage.cooksd, the Cook’s distance, a metric of influence.std.resid, the standardized residualUse spmodel’s plot function on the spmod object to construct a plot of the fitted spatial covariance vs spatial distance. To learn more about the options for spmodel’s plot function, run ?plot.spmodel or visit this link.
03:00
Use spmodel’s plot function on the spmod object to construct any of the other 6 types of plots possible for this type of model. Then, explain what the plot is showing to your neighbor(s).
05:00
The moose data contains moose counts and moose presence for 218 spatial locations in Alaska.
The moose_preds data contains spatial locations that were not surveyed.
| elev | strat | geometry |
|---|---|---|
| 143.4000 | L | POINT (401239.6 1436192) |
| 324.4375 | L | POINT (352640.6 1490695) |
| 158.2632 | L | POINT (360954.9 1491590) |
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.310 9.02 0.0344 0.973
2 elev 0.0141 0.00806 1.76 0.0792
3 stratM 6.93 2.26 3.07 0.00217
4 elev:stratM -0.0273 0.0130 -2.10 0.0357
Using predict()
Using augment()
Simple feature collection with 100 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 100 × 4
elev strat .fitted geometry
* <dbl> <chr> <dbl> <POINT [m]>
1 143. L 3.45 (401239.6 1436192)
2 324. L 1.59 (352640.6 1490695)
3 158. L -0.267 (360954.9 1491590)
4 221. M 2.39 (291839.8 1466091)
5 209. M 7.62 (310991.9 1441630)
6 218. L -1.02 (304473.8 1512103)
7 127. L -1.23 (339011.1 1459318)
8 122. L -1.43 (342827.3 1463452)
9 191 L -0.239 (284453.8 1502837)
10 105. L 0.657 (391343.9 1483791)
# ℹ 90 more rows
Examine the help file ?augment.spmodel or by visiting this link and create site-wise 99% prediction intervals for the unsampled locations found in moose_preds.
05:00
The loocv() function can be used to perform leave-one-out cross validation on a fitted model object using mean-squared-prediction error (MSPE) loss:
Fit a model with count as the response variable from the moose data with a "spherical" spatial covariance model for the random errors but no predictors as fixed effects. Compare the MSPE from leave-one-out cross-validation for this model with the previously fit moosemod. Which model is better, according to the leave-one-out cross-validation criterion?
Then, for the model with the lower MSPE, obtain the leave-one-out cross validation predictions and their standard errors. Hint: run ?loocv or visit this link.
07:00
Incorporate additional arguments to splm() to:
Provide a vector of spcov_types:
spmods <- splm(formula = log_Zn ~ log_dist2road, data = moss,
spcov_type = c("exponential", "gaussian"))
names(spmods)[1] "exponential" "gaussian"
Natural to combine with glances() and predict():
Work with a neighbor to find 90% confidence intervals for the fixed effects in the Gaussian model with one of two functions: (1) tidy() or (2) confint(). Before beginning, decide with your neighbor who will work finding the intervals with (1) tidy() and who will work on finding the intervals with (2) confint().
05:00
| sample | log_dist2road | log_Zn | geometry |
|---|---|---|---|
| 001PR | 2.68 | 7.33 | POINT (-413585.3 1997623) |
| 001PR | 2.68 | 7.38 | POINT (-413585.3 1997623) |
| 002PR | 2.54 | 7.58 | POINT (-415367.2 1996769) |
| 003PR | 2.97 | 7.63 | POINT (-417186 1995645) |
random ArgumentIn splm(), the random argument follows similar syntax to how random effects are specified in the nlme and lme4 packages.
random = ~ group and random = (1 | group) specify random intercepts for each level of group.random = (x | group) specifies random intercepts for group and for each level of group to have a different slope for x.
Call:
splm(formula = log_Zn ~ log_dist2road, data = moss, spcov_type = "exponential",
random = ~(1 | sample))
Residuals:
Min 1Q Median 3Q Max
-2.6234 -1.3228 -0.8026 -0.2642 1.0998
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.66066 0.26770 36.09 <2e-16 ***
log_dist2road -0.55028 0.02071 -26.58 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared: 0.6605
Coefficients (exponential spatial covariance):
de ie range
3.153e-01 2.094e-02 1.083e+04
Coefficients (random effects):
1 | sample
0.07995
Perhaps a model with random intercepts for sample and random intercepts and slopes for year but without any spatial covariance is an even better fit to the data. Fit such a model by specifying spcov_type to be "none". Then, use glances() to see how well this non-spatial model fits the moss data compared to the spatially explicit models.
05:00
An anisotropic covariance does not behave similarly in all directions
aniso <- splm(log_Zn ~ log_dist2road,
data = moss, spcov_type = "exponential",
anisotropy = TRUE)
glances(spmod, aniso)# A tibble: 2 × 10
model n p npar value AIC AICc logLik deviance pseudo.r.squared
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 aniso 365 2 5 362. 372. 372. -181. 363. 0.705
2 spmod 365 2 3 367. 373. 373. -184. 363 0.683
Call:
splm(formula = log_Zn ~ log_dist2road, data = moss, spcov_type = "exponential",
anisotropy = TRUE)
Residuals:
Min 1Q Median 3Q Max
-2.5279 -1.2239 -0.7202 -0.1921 1.1659
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.54798 0.22291 42.83 <2e-16 ***
log_dist2road -0.54601 0.01855 -29.44 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared: 0.7048
Coefficients (exponential spatial covariance):
de ie range rotate scale
3.561e-01 6.812e-02 8.732e+03 2.435e+00 4.753e-01
attr(,"class")
[1] "exponential"
Visualize the anisotropic level curve for spmod using plot(). Hint: Run ?plot.spmodel or visit this link.
03:00
Examine the anisotropy section of Details in the splm() help file with ?splm. Then, with your neighbor, discuss which direction the model predicts two responses will be more correlated?
05:00
A factor variable (i.e., categorical) is a partition factor when two observations in separate levels of the partition factor should be uncorrelated
Steps:
Use spcov_initial() to specify the covariance type and any known, fixed covariance parameters.
Instead of specifying the spcov_type argument in splm(), specify the spcov_initial argument with the object from (1).
Call:
splm(formula = log_Zn ~ log_dist2road, data = moss, spcov_initial = init_spher)
Coefficients (fixed):
(Intercept) log_dist2road
9.7194 -0.5607
Coefficients (spherical spatial covariance):
de ie range
4.545e-01 8.572e-02 2.000e+04
Fit a "spherical" spatial covariance model to the moss data set without a nugget effect (i.e., the model should have the ie independent variance parameter set to 0 and treated as known). Verify in the summary output that the ie is indeed 0 for this model.
05:00
With a neighbor, compare the fit of the "spherical" model with the ie variance parameter known and fixed at 0 (the no nugget model from the previous exercise) with the fit of a "spherical" model where all spatial covariance parameters are unknown and are estimated using (1) the AIC metric and (2) the MSPE from leave-one-out cross-validation using the loocv() function.
Before beginning work, decide who will complete (1) AIC and who will complete (2) loocv().
05:00
See workbook for an example of prediction using random forest spatial residual models
local argument in splm() to fit a spatial linear model to a large data set.local argument in predict() (or augment()) to make predictions for a large data set.spmodel implements “local” spatial indexing (Ver Hoef, Dumelle, et al. 2023) for model fitting
local = TRUE in splm()
spmodel uses “local neighborhood prediction” to predict for unobserved spatial locations for a model fit to a large data set
local = TRUE in predict() or augment()
spglm() function in spmodel to fit generalized linear models for various model families (i.e., response distributions).\[ g(\boldsymbol{\mu}) = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \]
| Distribution | Data Type | Link Function |
|---|---|---|
| Poisson | Count | Log |
| Negative Binomial | Count | Log |
| Binomial | Binary | Logit |
| Beta | Proportion | Logit |
| Gamma | Skewed | Log |
| Inverse Gaussian | Skewed | Log |
| elev | strat | count | presence | geometry |
|---|---|---|---|---|
| 468.9 | L | 0 | 0 | POINT (293542.6 1541016) |
| 362.3 | L | 0 | 0 | POINT (298313.1 1533972) |
| 172.8 | M | 0 | 0 | POINT (281896.4 1532516) |
The family argument can be binomial, beta, poisson, nbinomial, Gamma, or inverse.gaussian
Notice the similarities between spglm() and glm()
Call:
spglm(formula = count ~ elev * strat, family = poisson, data = moose,
spcov_type = "spherical")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4245 -0.7783 -0.3653 0.1531 0.5900
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.230575 0.958201 -2.328 0.019919 *
elev 0.007623 0.003129 2.437 0.014820 *
stratM 2.752234 0.782853 3.516 0.000439 ***
elev:stratM -0.010248 0.004472 -2.292 0.021928 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared: 0.09573
Coefficients (spherical spatial covariance):
de ie range
3.892 1.163 51204.657
Coefficients (Dispersion for poisson family):
dispersion
1
Fit a spatial negative binomial model to the moose data with count as the response and elev, strat, and their interaction as predictors. The negative binomial model relaxes the assumption in the spatial Poisson generalized linear model that the mean of a response variable \(Y_i\) and the variance of a response variable \(Y_i\) must be equal. Obtain a summary of the fitted model. Then compare their fits using loocv(). Which model is preferable?
05:00
Using predict()
Using augment()
Simple feature collection with 100 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 100 × 4
elev strat .fitted geometry
* <dbl> <chr> <dbl> <POINT [m]>
1 143. L 0.207 (401239.6 1436192)
2 324. L -0.0563 (352640.6 1490695)
3 158. L -1.24 (360954.9 1491590)
4 221. M -1.16 (291839.8 1466091)
5 209. M 1.78 (310991.9 1441630)
6 218. L -1.84 (304473.8 1512103)
7 127. L -2.80 (339011.1 1459318)
8 122. L -2.45 (342827.3 1463452)
9 191 L -0.409 (284453.8 1502837)
10 105. L -1.10 (391343.9 1483791)
# ℹ 90 more rows
All advanced features available in spmodel for spatial linear models (splm()) are also available for spatial generalized linear models (spglm())
Use spglm() to fit a spatial logistic regression model to the moose data using presence as the response variable and a "cauchy" covariance function. Then, find the predicted probabilities that moose are present at the spatial locations in moose_preds (Hint: Use the type argument in predict() or augment()).
07:00
Simulate spatial Gaussian data using sprnorm().
Simulate spatial binary, proportion, count, and skewed data using sprbinom(), sprbeta(), sprpois(), sprnbinom(), sprgamma(), and sprinvgauss().
spcov_params() to specify the correlation structure and covariance parameterssprnorm(params, data) to simulate a realization of the spatial process defined by spcov_params() at locations in data.What does an empirical semivariogram of the simulated data look like?
Use sprpois(params, data) to simulate a Poisson realization and visualize
spmodel requires the Cholesky decomposition of the covariance matrix, which can take awhile for sample sizes exceeding 10,000samples) takes nearly the same time as simulating just one.Use the spautor() function in spmodel to fit a spatial linear model to areal data.
Apply functions used for point-referenced data fit with splm() to areal data fit with spautor().
Figure 8: Distribution of the seal data.
Model syntax for spautor() (spgautor()) is similar to the syntax used for splm() (spglm()):
Call:
spautor(formula = log_trend ~ 1, data = seal, spcov_type = "car")
Residuals:
Min 1Q Median 3Q Max
-0.34441 -0.10403 0.04423 0.07351 0.20489
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.07103 0.02492 -2.851 0.00436 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Coefficients (car spatial covariance):
de range extra
0.03226 0.42020 0.02235
Choose a couple of helper functions that you would like to explore and apply those functions to the fitted seal model.
03:00
Interpret your findings from the previous exercise with your neighbor(s), explaining which functions you selected to use and what the associated output means.
05:00
Predictions must occur for “missing” (NA) responses in the observed data
Using predict()
Using augment()
Simple feature collection with 28 features and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 913618.8 ymin: 1007542 xmax: 1115097 ymax: 1132682
Projected CRS: NAD83 / Alaska Albers
# A tibble: 28 × 3
log_trend .fitted geometry
* <dbl> <dbl> <POLYGON [m]>
1 NA -0.115 ((1035002 1054710, 1035002 1054542, 1035002 1053542, 1035…
2 NA -0.00918 ((1043093 1020553, 1043097 1020550, 1043101 1020550, 1043…
3 NA -0.0603 ((1099737 1054310, 1099752 1054262, 1099788 1054278, 1099…
4 NA -0.0360 ((1099002 1036542, 1099134 1036462, 1099139 1036431, 1099…
5 NA -0.0724 ((1076902 1053189, 1076912 1053179, 1076931 1053179, 1076…
6 NA -0.0549 ((1070501 1046969, 1070317 1046598, 1070308 1046542, 1070…
7 NA -0.0976 ((1072995 1054942, 1072996 1054910, 1072997 1054878, 1072…
8 NA -0.0715 ((960001.5 1127667, 960110.8 1127542, 960144.1 1127495, 9…
9 NA -0.0825 ((1031308 1079817, 1031293 1079754, 1031289 1079741, 1031…
10 NA -0.0593 ((998923.7 1053647, 998922.5 1053609, 998950 1053631, 999…
# ℹ 18 more rows
Verify that the fitted autoregressive model with the seal data changes when the polygons with missing response values are excluded from the data argument in spautor(). The following code creates a data without the polygons with missing values:
05:00
spmodel Workshop